mean_absolute_error (MAE)#
Mean Absolute Error (MAE) measures the average absolute difference between targets and predictions.
If \(y\) is the true target and \(\hat{y}\) is the prediction, MAE answers:
“On average, how many units am I off?”
It is one of the most interpretable regression metrics because it is expressed in the same units as the target.
Learning goals#
Define MAE precisely (with math)
Build intuition with plots
Implement MAE from scratch in NumPy (including weights / multi-output)
Use MAE as an optimization objective (L1 / median regression) with subgradient descent
Understand pros/cons and when to use MAE
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score, train_test_split
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Definition#
For \(n\) samples, MAE is:
Let the residual be \(r_i = \hat{y}_i - y_i\). Then MAE is the mean of \(|r_i|\).
Equivalently (vector form for 1D targets): \(\mathrm{MAE}(y, \hat{y}) = \frac{1}{n}\lVert y - \hat{y}\rVert_1\).
Weighted MAE#
With non-negative sample weights \(w_i\):
Multi-output targets#
If \(y \in \mathbb{R}^{n \times m}\) (multiple outputs), you typically compute MAE per output and then aggregate (e.g. uniform average).
2) Intuition: MAE is “average distance”#
Absolute error is a distance on the number line:
it never cancels out (negative errors do not offset positive ones)
it is linear: doubling an error doubles its contribution
it’s in the same unit as \(y\) (e.g. dollars, °C, minutes)
A useful mental model is “typical miss size”.
# A tiny example
y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])
residuals = y_pred - y_true
abs_errors = np.abs(residuals)
mae_manual = abs_errors.mean()
mae_sklearn = mean_absolute_error(y_true, y_pred)
residuals, abs_errors, mae_manual, mae_sklearn
(array([-0.5, 0.5, 0. , 1. ]), array([0.5, 0.5, 0. , 1. ]), 0.5, 0.5)
Visual: each absolute error is a segment length#
For each sample \(i\), MAE takes the length of the segment between \(y_i\) and \(\hat{y}_i\), then averages those lengths.
idx = np.arange(len(y_true))
mid = 0.5 * (y_true + y_pred)
fig = go.Figure()
fig.add_trace(go.Scatter(x=idx, y=y_true, mode="markers", name="y_true", marker=dict(size=10)))
fig.add_trace(go.Scatter(x=idx, y=y_pred, mode="markers", name="y_pred", marker=dict(size=10)))
for i in idx:
fig.add_shape(
type="line",
x0=int(i), x1=int(i),
y0=float(y_true[i]), y1=float(y_pred[i]),
line=dict(color="gray", width=2),
)
fig.add_trace(
go.Scatter(
x=idx,
y=mid,
mode="text",
text=[f"|e|={e:.2f}" for e in abs_errors],
showlegend=False,
textposition="middle right",
)
)
fig.update_layout(
title=f"Absolute errors per sample (MAE = {mae_manual:.3f})",
xaxis_title="sample index i",
yaxis_title="value",
)
fig.show()
3) MAE vs MSE: different penalty shapes#
Both MAE and MSE summarize residuals \(r = \hat{y} - y\), but they penalize them differently:
MAE uses \(|r|\) (linear penalty)
MSE uses \(r^2\) (quadratic penalty)
So MSE puts much more weight on large errors, while MAE is more robust to outliers.
Subgradient of the absolute value#
The absolute value is not differentiable at 0, but it has a subgradient:
In code we typically use np.sign(r) (and treat exactly-zero residuals as 0).
r = np.linspace(-5, 5, 401)
l1 = np.abs(r)
l2 = r**2
# Huber (smooth L1) is often used as a differentiable alternative
delta = 1.0
huber = np.where(np.abs(r) <= delta, 0.5 * r**2, delta * (np.abs(r) - 0.5 * delta))
fig = go.Figure()
fig.add_trace(go.Scatter(x=r, y=l1, name="MAE pointwise: |r|", line=dict(width=3)))
fig.add_trace(go.Scatter(x=r, y=l2, name="MSE pointwise: r²", line=dict(width=3)))
fig.add_trace(go.Scatter(x=r, y=huber, name="Huber (smooth L1)", line=dict(width=3, dash="dot")))
fig.update_layout(
title="Pointwise loss vs residual r",
xaxis_title="residual r = ŷ - y",
yaxis_title="loss for a single sample",
)
fig.show()
Outlier sensitivity: one bad point#
Suppose \(n-1\) predictions are perfect (\(r=0\)) and one sample has residual \(m\). Then:
Even though both are averaged, MSE still grows quadratically in \(|m|\), so a single large error can dominate it.
n_samples = 50
m = np.linspace(0, 20, 401)
mae_one = m / n_samples
mse_one = (m**2) / n_samples
fig = go.Figure()
fig.add_trace(go.Scatter(x=m, y=mae_one, name="MAE (one outlier)", line=dict(width=3)))
fig.add_trace(
go.Scatter(
x=m,
y=mse_one,
name="MSE (one outlier)",
yaxis="y2",
line=dict(width=3, dash="dash"),
)
)
fig.update_layout(
title=f"Effect of 1 outlier among n={n_samples} samples",
xaxis_title="outlier magnitude |m|",
yaxis_title="MAE (same units as y)",
yaxis2=dict(title="MSE (squared units)", overlaying="y", side="right"),
)
fig.show()
4) A key property: MAE is minimized by the median#
Suppose your model can only output a constant prediction \(c\) (no features). The objective becomes:
The value(s) of \(c\) that minimize \(J(c)\) are the median(s) of \(y\).
Intuition: moving \(c\) slightly to the right increases the loss by \(+\varepsilon\) for every point left of \(c\), and decreases it by \(-\varepsilon\) for every point right of \(c\). At the optimum those counts balance — that’s the median.
y = rng.normal(loc=0.0, scale=1.0, size=250)
y[:6] += 8 # a few big outliers
c_grid = np.linspace(y.min() - 1, y.max() + 1, 500)
mae_c = np.mean(np.abs(y[:, None] - c_grid[None, :]), axis=0)
c_argmin = float(c_grid[np.argmin(mae_c)])
y_median = float(np.median(y))
y_mean = float(y.mean())
fig = go.Figure()
fig.add_trace(go.Scatter(x=c_grid, y=mae_c, name="MAE(c)", line=dict(width=3)))
fig.add_vline(x=y_median, line=dict(color="green", dash="dash"), annotation_text="median(y)")
fig.add_vline(x=y_mean, line=dict(color="red", dash="dot"), annotation_text="mean(y)")
fig.add_vline(x=c_argmin, line=dict(color="black"), annotation_text="argmin")
fig.update_layout(
title="For a constant predictor, MAE is minimized at the median",
xaxis_title="constant prediction c",
yaxis_title="MAE(c)",
)
fig.show()
y_mean, y_median, c_argmin
(0.14334962923664443, -0.06675038086235976, -0.07859386281040903)
5) NumPy implementation from scratch#
Below is a small NumPy implementation that matches scikit-learn’s behavior for:
sample_weight(weighted average over samples)multi-output targets with
multioutput:"raw_values"(per-output MAE)"uniform_average"(average of per-output MAEs)an array of output weights
def mean_absolute_error_np(y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")
if y_true.ndim == 1:
abs_err = np.abs(y_true - y_pred)
if sample_weight is None:
return float(abs_err.mean())
w = np.asarray(sample_weight)
if w.shape != abs_err.shape:
raise ValueError(f"sample_weight must have shape {abs_err.shape}, got {w.shape}")
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
return float(np.sum(w * abs_err) / np.sum(w))
if y_true.ndim != 2:
raise ValueError("y_true must be 1D or 2D")
abs_err = np.abs(y_true - y_pred) # (n_samples, n_outputs)
if sample_weight is None:
output_errors = abs_err.mean(axis=0)
else:
w = np.asarray(sample_weight)
if w.ndim != 1 or w.shape[0] != abs_err.shape[0]:
raise ValueError(
f"sample_weight must have shape (n_samples,), got {w.shape} for n_samples={abs_err.shape[0]}"
)
if np.any(w < 0):
raise ValueError("sample_weight must be non-negative")
output_errors = np.sum(abs_err * w[:, None], axis=0) / np.sum(w)
if multioutput == "raw_values":
return output_errors
if multioutput == "uniform_average":
return float(output_errors.mean())
weights = np.asarray(multioutput)
if weights.shape != (abs_err.shape[1],):
raise ValueError(
f"multioutput weights must have shape (n_outputs,), got {weights.shape} for n_outputs={abs_err.shape[1]}"
)
if np.any(weights < 0):
raise ValueError("multioutput weights must be non-negative")
return float(np.average(output_errors, weights=weights))
# Quick check vs scikit-learn
y_true_2d = rng.normal(size=(12, 3))
y_pred_2d = y_true_2d + rng.normal(scale=0.2, size=(12, 3))
w = rng.uniform(0.5, 2.0, size=12)
ours = mean_absolute_error_np(y_true_2d, y_pred_2d, sample_weight=w, multioutput="raw_values")
sk = mean_absolute_error(y_true_2d, y_pred_2d, sample_weight=w, multioutput="raw_values")
ours, sk, np.allclose(ours, sk)
(array([0.1673, 0.1563, 0.1535]), array([0.1673, 0.1563, 0.1535]), True)
6) Using MAE to fit a model: L1 regression (median regression)#
If we use MAE as the training objective, we get L1 regression. For a linear model:
the MAE objective is:
This is convex, but not differentiable everywhere. A common low-level optimizer is subgradient descent.
Let \(r = Xw + b - y\) and \(s = \mathrm{sign}(r)\) (using 0 when \(r=0\)). One valid subgradient is:
Statistical interpretation: minimizing MAE corresponds to maximum likelihood under Laplace (double exponential) noise, and it targets the conditional median (robust to outliers).
# Synthetic regression data with outliers
n = 220
x = rng.uniform(0, 10, size=n)
X = x[:, None]
beta0_true = 1.0
beta1_true = 2.0
# Laplace noise matches the MAE/L1 modeling assumption
noise = rng.laplace(loc=0.0, scale=1.0, size=n)
y = beta0_true + beta1_true * x + noise
# Inject a few large outliers
outlier_idx = rng.choice(n, size=7, replace=False)
y[outlier_idx] += rng.normal(loc=25.0, scale=5.0, size=outlier_idx.size)
outlier_idx[:5], x.min(), x.max()
(array([131, 65, 31, 24, 118]), 0.054298341287004614, 9.991047304742386)
# Fit ordinary least squares (OLS): minimizes MSE, not MAE
ols = LinearRegression().fit(X, y)
y_pred_ols = ols.predict(X)
mae_ols = mean_absolute_error(y, y_pred_ols)
mse_ols = mean_squared_error(y, y_pred_ols)
(ols.intercept_, ols.coef_[0]), (mae_ols, mse_ols)
((1.5939862944862195, 2.034722899043242),
(1.9884324902995665, 21.556270780903915))
def fit_linear_regression_mae_subgradient(X, y, *, lr0=0.8, n_iters=3000):
"""Minimize MAE for y_hat = X @ w + b using subgradient descent.
Uses a decaying learning rate: lr_t = lr0 / sqrt(t+1).
"""
X = np.asarray(X)
y = np.asarray(y)
n_samples, n_features = X.shape
w = np.zeros(n_features)
b = float(np.median(y)) # good starting point when w=0
history = np.empty(n_iters)
for t in range(n_iters):
y_hat = X @ w + b
r = y_hat - y
s = np.sign(r) # subgradient of |r|
grad_w = (X.T @ s) / n_samples
grad_b = s.mean()
lr = lr0 / np.sqrt(t + 1)
w -= lr * grad_w
b -= lr * grad_b
history[t] = np.mean(np.abs(r))
return w, b, history
# Feature scaling improves optimization stability
x_mean = float(x.mean())
x_std = float(x.std())
X_scaled = ((x - x_mean) / x_std)[:, None]
w_scaled, b_scaled, hist = fit_linear_regression_mae_subgradient(X_scaled, y)
# Convert back to original x scale
w_mae = w_scaled[0] / x_std
b_mae = b_scaled - w_scaled[0] * x_mean / x_std
y_pred_mae = w_mae * x + b_mae
mae_mae = mean_absolute_error(y, y_pred_mae)
mse_mae = mean_squared_error(y, y_pred_mae)
(b_mae, w_mae), (mae_mae, mse_mae)
((1.0142419496584196, 2.001810613820674),
(1.8120687855249078, 22.118168492697624))
fig = px.line(
y=hist,
title="Subgradient descent: MAE objective vs iteration",
labels={"index": "iteration", "y": "train MAE"},
)
fig.show()
x_line = np.linspace(x.min(), x.max(), 250)
y_line_true = beta0_true + beta1_true * x_line
y_line_ols = ols.intercept_ + ols.coef_[0] * x_line
y_line_mae = b_mae + w_mae * x_line
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=x, y=y, mode="markers", name="data",
marker=dict(size=7, color="rgba(0,0,0,0.55)"),
)
)
fig.add_trace(
go.Scatter(
x=x[outlier_idx], y=y[outlier_idx], mode="markers", name="outliers",
marker=dict(size=10, color="crimson", symbol="x"),
)
)
fig.add_trace(
go.Scatter(
x=x_line, y=y_line_true, mode="lines", name="true line",
line=dict(color="green", dash="dash"),
)
)
fig.add_trace(
go.Scatter(
x=x_line, y=y_line_ols, mode="lines",
name=f"OLS (MSE) fit | MAE={mae_ols:.2f}",
line=dict(width=3),
)
)
fig.add_trace(
go.Scatter(
x=x_line, y=y_line_mae, mode="lines",
name=f"L1 (MAE) fit | MAE={mae_mae:.2f}",
line=dict(width=3),
)
)
fig.update_layout(
title="Outliers pull OLS more than L1/MAE regression",
xaxis_title="x",
yaxis_title="y",
)
fig.show()
7) Practical usage (scikit-learn)#
MAE is typically used as an evaluation metric on validation/test sets and for model selection.
In scikit-learn:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_true, y_pred)
For cross-validation, scikit-learn uses a “bigger is better” convention, so it exposes MAE as negative MAE:
cross_val_score(model, X, y, scoring="neg_mean_absolute_error")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
model = LinearRegression().fit(X_train, y_train)
y_test_pred = model.predict(X_test)
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mae
1.9732338388900619
scores = cross_val_score(LinearRegression(), X, y, scoring="neg_mean_absolute_error", cv=5)
mae_scores = -scores
mae_scores, mae_scores.mean()
(array([2.1155, 2.1505, 2.1875, 2. , 1.6525]), 2.021217652798245)
Pros, cons, and when to use MAE#
Pros#
Interpretable: same units as the target (easy to communicate)
Robust-ish to outliers: large errors grow linearly, not quadratically
Median-targeting: minimizing MAE targets the conditional median (useful when outliers/heavy tails exist)
Cons#
Not differentiable at 0: requires subgradients or smooth approximations (Huber / smooth L1)
Doesn’t emphasize large errors: if big misses are especially costly, MAE may under-penalize them
Scale-dependent: MAE values can’t be compared across targets with different units/scales
Good use cases#
When the cost of an error grows roughly linearly with its magnitude
When your data has outliers or heavy-tailed noise and you want a stable “typical error” number
When you care about the median behavior rather than the mean
Common pitfalls / diagnostics#
If \(y\) has a wide range, MAE can look “large” even for good models; compare to a baseline (e.g. predict median).
MAE alone hides error distribution; pair it with a plot of residuals or a percentile-based metric.
If your business cost is asymmetric (over-prediction vs under-prediction differ), consider pinball loss / quantile regression instead of plain MAE.
Exercises#
Prove that the minimizer of \(\sum_i |y_i - c|\) is any median of \(y\).
Modify
mean_absolute_error_npto support anaxisargument.Compare MAE vs RMSE on the same dataset as you add more outliers. What changes first?
References#
scikit-learn:
sklearn.metrics.mean_absolute_errorRobust regression / L1 loss: connection to Laplace noise and median regression